Performance of Large Language Models on a Neurology Board–Style Examination

Key Points Question What is the performance of large language models on neurology board–style examinations? Findings In this cross-sectional study, a newer version of the large language model significantly outperformed the mean human score when given questions from a question bank approved by the American Board of Psychiatry and Neurology, answering 85.0% of questions correctly compared with the mean human score of 73.8%, while the older model scored below the human average (66.8%). Both models used confident or very confident language, even when incorrect. Meaning These findings suggest that with further refinements, large language models could have significant applications in clinical neurology.

Chi-squared test was used to calculate p-values.P-values were adjusted for multiple testing using the Bonferroni correction.
Chi-squared test was used to calculate p-values.P-values were adjusted for multiple testing using the Bonferroni correction.eMethods.
To test for memorization of the questions and answers, we performed a series of analyses that follow common approaches for analyzing memorization in large language models 1,2 .There, memorization is defined as the "ability to generate the true continuation when choosing the most likely token at every step of decoding" 1 .Tokens are defined as words or groups of characters that appear in a text.To increase the probability to retrieve potential memorizations, we set the model temperature, which is a measure of how deterministic or random the output will be, to 0, to get the most likely and least random continuation.
In both models LLM 1 and LLM 2, in 0 of the 1956 tested multiple choice problems, the beginning a could be correctly continued.
Second, we performed a further analysis, analogous to a method introduced by Biderman et al 2 .For this, a score is defined based on the "number of ordered matching tokens" between the true continuation and the model's continuation c, where the number of matching tokens is divided by the number of tokens in the true continuation.A memorized sequence will have a score equal to 1.In both models LLM 1 and LLM 2, the calculated scores were not equal to 1: (LLM 1: mean = 0.14, SD = 0.1, n = 1956, LLM 2: mean = 0.14, SD = 0.1, n = 1956.As it is known that larger models tend to memorize faster 3 , we hypothesized that if the models were trained on the questions, LLM 2 should be able to memorize better than LLM 1.Both models performed equally poor with no significant difference (paired ttest, p=0.64, t-statistic=-0.46,95 % CI [-0.005, 0.003], n = 1956, mean of the differences: -0.001).
Third, we analyzed the portion of tokens that were matching between the prediction and the true continuation.In LLM 2, the 10 most often matching tokens accounted for 45.46 % of all matching tokens (top 10 matching tokens: " .", " the", " of", " ,", " is", " and", " D", " C", " in", " to").The occurrence of the tokens " D", " C" is indicative of the model's ability to continue a list if prompted with a beginning string of "A), … B)".Similar results were observed in LLM 1 (top 10 matching tokens: " ."," the", " of", " ," , " is", " and", " C" , " D", " in" , " patient", representing 46.73% of all matching tokens).The rest of the tokens was composed of a diverse range of terms in the medical context.The complete count tables for the matching tokens for both models are found in Supplementary Material 2.

eFigure 5 :
Confidence of Language in Correctly and Incorrectly Answered Questions Bar plot visualizing self-assessed confidence on a Likert scale by LLM 1 and LLM 2. Questions are colored based on whether they were answered correctly.(N=1956) eFigure 6. Length of Question Between Incorrectly and Correctly Answered Questions Between LLM 1, LLM 2, and Question Bank Users Separately, Primary vs High-Order Question Percentage Left: Comparison of Question Length between correctly and incorrectly answered questions for GPT3.5 and GPT4 (N=1956 questions).Right: Correlation plot between the percentage of correctly answering users per question and the question length, one dot representing a single question.Questions are colored based on the perecentage of users correctly answering the question (N=1956).eFigure7. High-Dimensional tSNE Analyses of Question and Answer Embeddings T-SNE analysis of calculated embeddings of questions, each question represented by a single dot and colored based on their related topic.

First, in accordance
to Carlini et al., we split each multiple-choice problem s in the beginning of the problem a and the true continuation b.We then gave the beginning a to each model to retrieve the model's continuation c, while setting the maximum number of tokens that the model should return to the number of tokens of the true continuation b.
eTable 1. Performance of LLM 1 and LLM 2 on the EBN Question Samples Cohort Example of a Correctly Answered Lower-Order Question and Answers by LLM 1 and LLM 2 3. Example of an Incorrectly Answered Lower-Order Question and Answers by LLM 1 and LLM 2 4. Example of an Incorrectly Answered Higher-Order Question and Answers by LLM 1 and LLM 2 © 2023 Schubert MC et al.JAMA Network Open.eFigure 1. symmetric.MRI of the lumbal spine showed degenerative disc disease at all lumbar levels, with slight bilateral foraminal stenosis at L3 and L4, but no herniated disc.Results of an electromyogram were normal.His cerebrospinal fluid showed a predominantly lymphocytic pleocytosis (110 cells/L), a raised protein level (570mg/L) and intrathecal Ig-synthesis.Which diagnosis most likely explains the clinical features of this patient?A: Neuroborreliosis.B: Neurosarcoidosis.C: Inflammatory plexopathy.D: Postherpetic neuralgia.E: Wartenberg syndrome.A: Neuroborreliosis.It's worth noting that Lyme disease is often associated with a characteristic "bull's-eye" rash (erythema migrans), but it may not appear in all cases, and the disease can still progress to neuroborreliosis without it.Furtherdiagnostic testing, such as serological tests for antibodies against Borrelia burgdorferi, would be needed to confirm the diagnosis.©2023 Schubert MC et al.JAMA Network Open.eFigure eFigure A: Internal carotid artery stenosis.B: Migrainous infarction.C: Reversible vasoconstriction syndrome.D: Posterior reversible encephalopathy syndrome.E: Vertebral artery dissection.B: Migrainous infarction.